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Abstract 

The genome of bacterial species is much more flexible than that of eu- 
karyotes. Moreover, the distributed genome hypothesis for bacteria states 
that the total number of genes present in a bacterial population is greater 
than the genome of every single individual. The pangenome, i.e. the set 
of all genes of a bacterial species (or a sample), comprises the core genes 
which are present in all living individuals, and accessory genes, which 
are carried only by some individuals. In order to use accessory genes for 
adaptation to environmental forces, genes can be transferred horizontally 
between individuals. Here, we extend the infinitely many genes model 
from Baumdicker, Hess and Pfaffelhuber (2010) for horizontal gene trans- 
fer. We take a genealogical view and give a construction - called the 
Ancestral Gene Transfer Graph - of the joint genealogy of all genes in 
the pangenome. As application, we compute moments of several statistics 
(e.g. the number of differences between two individuals and the gene fre- 
quency spectrum) under the infinitely many genes model with horizontal 
gene transfer. 
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1 Introduction 



Today, many prokaryotic species (i.e. bacteria and archea) are known to have 



highly flexible genomes (e.g. Tettelin et al. 2005 Ehrlich et al. 2005 Tet 



telin et aT] 2008 Koonin and Wol^ 2012). Unlike in eukaryotes, genes can be 



transferred horizontally (i.e. without a direct relationship between donor and 
recipient) between prokaryotic individuals of either different or the same popula- 
tion. As a result, gene content can differ substantially between strains from the 
same population. For example, the pathogenic strain E. coli 0157:H7 carries 



1387 genes which are absent in the commensal strain E. coli K-12 (Perna et al 



2001 ). This huge variation in gene content led to the concepts of the distributed 
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genome of bacteria and their pangenome (Tettelin et al. 2005 Ehrlich et al. 



2005). In datasets, genes present in all genomes of a taxon are called core genes 
while genes present in only some but not all individuals comprise the accessory 
genome. The latter set of genes is further split into the medium-frequency shell 



of genes and the cloud of genes of low frequency Koonin and Wolf (2008). 

In order to understand the growing amount of genomic data from bacterial 
species, classical population genetic theory - using mutation, selection, recom- 
bination and genetic drift as main evolutionary forces - must be extended in 
order to include realistic mechanisms of horizontal gene transfer (HGT). Since 
genomic data from prokaryotic species has become abundant only recently, HGT 



can in particular be seen as a newly discovered evolutionary factor (Doolittle 



1999 Koonin et al. 1997). Current estimates for the amount of genes which are 



horizontally transferred are at least as high as 32% (Koonin et al. 2001 Dagan 



and Martin 20071. It may even be argued that this number is still a lower 
bound because only a fraction of all events of HGT can be seen in data, either 
because the transferred gene is subsequently lost or the pattern is in accordance 



to vertical gene transfer (Gogarten et al. 2002) 



The tree of life has become a classical way of thinking about inheritance 
since Darwin's Origin of Species. However, the abundance of HGT counteracts 
the tree- like structures evolutionary biologists like to think about. Results are 
phylogenetic networks, which display at the same time the joint evolutionary 



fate of many genes (Huson and Scornavacca 2011 Dagan 2011), in addition 



to other reticulate events such as hybridization and incomplete lineage sorting. 
It is becoming clear that any genealogical tree of bacteria which have a flexible 



genome is at most a tree of 1% of all genetic material (Dagan and Martin 



2006 ) , which may eventually lead to a paradigm shift in evolutionary biology of 



prokaryotes (Koonin and Wolf 2012) 



Unraveling the amount of HGT shaping bacterial diversity can today be 
tackled using a growing amount of genomic data. In particular, several datasets 
from closely related strains, which are of the same bacterial species are avail- 



able today (Medini et al. 2005 Tettelin et al. 2005 2008 1 . In addition, many 



different approaches have led to a number of methods to estimate HGT rates 



and identify the corresponding genes ( 


Lawrence and Ochman 


2002 


Kunin and 


Ouzounis 


2003 


Nakhleh et al. 


2005 


Linz et al. 


2007 


Didelot et al. 


2010 


1. 



However, theoretical work on the population genomics of HGT is still in its 
infancy. In order to include HGT in population genetic models, some scenar- 
ios have to be distinguished according to the following basal mechanisms: (a) 
Transformation, which is the uptake of genetic material from the environment, 
(b) Transduction, which describes the infection of a prokaryote by a lysogenic 
virus (phage) which provides additional genetic material that can be built in 
the bacterial genome, (c) Conjugation, which is also termed bacterial sex., which 
requires a direct link (pilus) between two bacterial cells and leads to exchange 
of genetic material. In addition, small virus-like elements called Gene Transfer 
Agents (GTAs) have been found which may become even more important for the 
amount of horizontal genetic exchange in some species (McDaniel et al. 20101. 



Another mechamism of horizontal gene transfer are due to mobile genetic ele- 
ments like plasmids, gene cassettes and transposons, which transfer genes even 
within a single individual ( de la Cruz and Davies[ 2000). Most importantly, 
when considering horizonal gene transfer by (a), (b) and (c), transformation 
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(and to a lesser extend also transduction) transfers genes mainly from between 
distantly related species, while conjugation only works for bacteria from closely 
related species. 

A first approach of the population genomics of bacteria was made by 



Novozhilov et al. (2005), extending a model from Berg and Kurland (2002 1. 



Here, a birth and death process is used in order to describe the evolution of the 
frequency of a single gene under selection under within-population horizontal 
transmission ( "infection" ) , mutation (leading to loss of the gene) and population 
size changes. However, this study is limited since only a single gene is consid- 
ered, but bacterial genomes are comprised of several hundreds of genes, each 
of which may be under selection and horizontal gene transfer. In ' Mozhayskiy| 



and Tagkopoulos (2012), a simulation study was carried out, taking selective 



forces into account which arise from gene regulatory networks, i.e. epistasis 
of presence and absence of genes. Finally, Vogan and Higgs (2011) present a 



macro-evolutionary model and conclude that HGT was probably favorable in 
early evolution since loss of genes is frequent, but later, when genomes are rather 
optimized, HGT is not favorable and gene losses are rarer. 



In Baumdicker et al. (2010) (see also Baumdicker et al. 2012), we presented 



the infinitely many genes model, a population genomic model which includes 
HGT from different species, e.g. by transformation, but no HGT within species. 
It accounts for gene gain (or gene uptake) from the environment (at rate 0/2) 
along the genealogical tree which describes the relationships between the indi- 
viduals of the population. The term gene gain covers HGT from other species as 
well as gene genesis, because from the perspective of a species under considera- 
tion these two mechanisms are indistinguishable. Pseudogenization may lead to 
deletion of genes and is incorporated by gene loss (at rate p/2). The model uses 



the coalescent (Kingman 1982 Hudson 1983) as underlying genealogy instead 



of a fixed (phylogenetic) tree. On the latter, the same two mechanisms were 
studied already by Huson and Steel ( 2004 1 . A variant of the infinitely many 



genes model was introduced by Haegeman and Weitz (2012), who couple gene 
gain and loss events in order to obtain a genome of constant size. However, 
this is in contrast to available data, since flexible genomes of bacteria usually 
come with different genome sizes of up to 50%. An interesting extension of the 
infinitely many genes model was studied in Collins and Higgs (2012). Here, 



different random trees were used as underlying genealogies as well as different 
classes of genes, each class with its own rate of gene gain and loss. It was found 
that the coalescent produces a good fit with data, and it is likely that the rate 
of gene gain and loss depends on the gene. 

In the present paper, we extend the infinitely many genes model in order 
to incorporate events of intraspecies horizontal gene transfer. We stress that 
HGT in bacteria differs from crossover recombination in eukaryotes, since only 
single, non-homologous, genes are horizontally transferred in bacteria, while only 
homologous genomic regions are transferred by recombination in eukaryotes. 
Accordingly, since we aim at a genealogical picture of HGT in bacteria, the 



ancestral recombination graph (Hudson, 1983 Griffiths and Marjoram 1997) as 
an extension of the coalescent, cannot be used. Rather, we model HGT such 
that each gene present in the population comes with its own events of HGT, 
resulting in the Ancestral Gene Transfer Graph (AGTG). In the limit of large 
population sizes, we compute moments of several quantities of interest. The gene 
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frequency spectrum - see Theorera 
k out of n individuals. In Theorem 



Tl- describes the amount of genes present in 
21 we give our results for the expectations of 
the average number of genes per individual, and the average number of pairwise 
differences and the total number of genes, respectively. Calculations which give 
the variances of some of these quantities, can be carried out using the AGTG 
and are given in Theorem [3j 



The paper is organized as follows: In Section [2] we introduce the infinitely 
many genes model with horizontal gene transfer. After stating our results in 
Section [3] we introduce the AGTG in Section |4j Proofs of the Theorems are 
then given in Section [5] 



2 The model 

We introduce two different views on the same model. In this section, we describe 
a Moran model forwards in time, including events of gene gain, gene loss and 
horizontal transfer of genes. Later, in Section |4] we describe how to obtain the 
distribution of genes in equilibrium using a genealogy-based approach. 

We consider the following model for bacterial evolution: Each bacterial cell 
carries a set of genes and every gene belongs either to the core genome or to the 
accessory genome. The infinite set / := [0, 1] is the set of conceivable accessory 
genes and Gc with 5c H / = is the core genome. A population of constant 
size consists of N individuals (bacterial cells). We model the accessory genome 
of individual i at time i by a finite counting measure Gj^ {t) on /. We will 
identify finite counting measures with the set of atoms, i.e. we write u e Gf^ (t) 
if (5f (i), 1«> > 1- The dynamics of the model is such that (5f (t), 1«) < 1 for 
all i and u € [0, 1], almost surely. 

The population evolves according to Moran dynamics. That is, time is con- 
tinuous and every (ordered) pair of individuals undergoes a resampling event 
at rate 1. Here, in each resampling event between individuals i and j, one 
bacterium is chosen at random, produces one offspring which replaces individ- 
ual j (such that the population size stays constant). The offspring carries the 
same genes as the parent, i.e. if an offspring of i replaces j at time t, we have 
Gf {t) = Gf [t—)- In addition to such resampling events, the following (inde- 
pendent) events occur: 

1. Gene loss: For gene u e Gi^{t—) in individual i, at rate p/2, we have 
G^{t) = Gt'it-) \ {u}, i.e. gene u is lost from G^it). 

2. Gene gain: For every individual i, at rate 0/2, choose U uniform in [0, 1] 
and set Gf {t) = Gj^ {t—) U {J7}, i.e. every individual gains an (almost 
surely) new gene at rate 9/2. 

3. Horizontal gene transfer: For every (ordered) pair of individuals (i, j) and 
u G Gi' [t), a horizontal gene transfer event occurs at rate j/2N. For such 
an event, set Gf {t) = Gj'{t—)U{u}, i.e. individual i is the donor of gene u 
(the transferred gene) to the acceptor j. 

Horizontal gene transfer events can as well be written in the measure-valued 
notation as Gf{t) — {Gf{t—) + Su) A 1). The 'A 1 '-term indicates that we do 
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not model paralogous genes, i.e. horizontal gene transfer events have no effect 
if the acceptor individual j already carries the transferred gene. 

Definition 2.1 (Moran model with horizontal gene transfer). We refer to 
iGi^{t), ■ ■ ■ ,t/7v(*))t>o undergoing the above dynamics as the Moran model for 
bacterial genomes with horizontal gene flow. 

Remark 2.2 (Equilibrium). It can be shown that the Moran model of size 
N for bacterial genomes with horizontal gene flow is Harris recurrent and 
hence, has a unique equilibrium. We denote the corresponding limit by 
:=gf(oo),...,g^ -.^g^ioo). 

Remark 2.3 (Extensions of the model). Although the above introduced model 
for bacterial evolution with horizontal gene flow is among the most realistic mod- 
els for bacterial evolution available so far, several extensions are well possible, 
e.g.: 

• Gain, loss and transfer of multiple genes: The exact mechanisms of gene 
gain, loss and HGT are still under study. However, it seems clear that 
several genes can be gained or lost at once. 

• Gene families: Frequently, a single gene is present not only once but 
several times in a bacterial genome. The reason can either be a copying 
event along its ancestral line, or the gene is introduced by HGT although 
it was already present. 

• Gene synteny: The order of genes in the genome is called gene synteny. 
In our model, the synteny of genes is not modeled, but can be observed 
in genomic data. Above all, gene synteny can give hints of events of 
horizontal gene transfer, since the order of genes can be different in donor 
and acceptor. 

• Mobile genetic elements: There are parts of the genome like mobile el- 
ements which are more likely to be transferred horizontally. Examples 
are transposons, plasmids and gene cassettes, i.e. horizonral gene transfer 
even within a single cell can be considered. 

Remark 2.4 (Measure- valued version). In mathematical population genetics, 
one frequently studies models of finite populations forward in time, constructs 
their diffusion limit - most often a Fleming- Viot measure-valued diffusion - 
and only afterwards uses genealogical relationships in order to compute specific 
properties of the underlying forwards model. We take another route here and 
leave the construction of the infinite model forwards in time for future research. 
Here, one would have to consider the set of counting measures on [0, 1] as a 
type space (which is locally compact), and define the current state of a finite 
population as the empirical measure of types on this state space. Constructing 
the diffusion limit then gives the measure-valued diffusion. Since our interest 
lies in seeing the effects of horizontal gene transfer on summary statistics (see 
Theorems [lj|3]) , we do not follow this route within the current manuscript. 

We are mainly interested in large populations. The corresponding limit is usu- 
ally referred to as large population limit in the population genetic literature. 
The following result of the Moran model with HGT will already be useful in 
various applications. 
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Lemma 2.5 (The frequency path of a single gene). Let (t) be the frequency 
of gene u at time t in the Moran model for bacterial genomes with horizontal 
gene flow of size N with X^(0) such that {Q) x. Then, in the 

large population limit, N — > oo, the process [X^ {t))t>o converges weakly to the 
solution of the SDE 



dX={ -^X+ ^X{1 - X))dt + y'X{l-X)dW 
with X(0) — X for some Brownian motion W . 



(2.1) 



Remark 2.6 (The diffusion (2.1) in population genetics). The diffusion (2.1) 



also appears in population genetics models including selection (see e.g. Kimura 
1964 Ewens 2004^ ,Durrett 2008). In the present setting, the term propor- 
tional to X(l — X)dt appears because horizontal gene flow increases the fre- 
quency of the gene by a rate which is proportional to the number of possible 
donor/acceptor-pairs of individuals. 

Due to the close connection of horizontal gene transfer with selective models, 
a comparison to recent work is appropriate. In particular, the theory for the 
frequency spectrum in selective models with irreversible mutations is carried 
out in |Fisher| ( |1930[ ); [Wright] ( |1938[ ); [Kimura] ( ]1964l ]1969] ). We rederive these 
results in our proof of Theorem ]lj below, but we stress that the genealogical 
interpretation we give is derived with a special focus on horizontal gene flow, 
but not to the selective case. 



Proof of Lemma \2.^ First note that gene loss reduces X^ with rate propor- 
tional to NX^ and p/2. Second, horizontal gene transfer increases X^ with 
rate proportional to ^/{2N) and to the number of pairs where the horizon- 
tal gene transfer events has an effect, N'^X-^ {1 — X^). By construction, the 
evolution of frequencies of gene u is a Markov process with generator 

(G^/)(a;) - N{N - l)x{l - x) (^-f{x + l/N) + ^/(x - l/N) - f{x)) 

- ^if{x - 1/iV) - fix)) + ^^'l^^'^^h fix + l/N) - fix)) 



N- 



^ ix(l - x)rix) + i~P-x + 7x(l - x))f'ix) 



for / e C^([0, 1]). Using e.g. standard results from 
to show (weak) convergence to the diffusion (2.1). 



Ewens 



(2004) it is now easy 
□ 



3 Results on Summary Statistics 

Consider a sample , . . . , of size n taken from the Moran model of size N 
in equilibrium. We introduce several statistics under the above dynamics: 

• The average number of genes ( in the accessory genome ) is given by 

:= := -V igf I (3.1) 

n ^-^ 

2—1 

where \Qf'\ :— {Gf , 1) is the total number of accessory genes in individ- 
ual i. 
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• The average number of pairwise differences is given by 

j^(n) .^^jjin,N) 1 y (3.2) 

n(n - 1) ^ ' ' ' ' 

where Gi' \ Gf '■— {Gf — Qf )'^ are the genes present in i but not in j. 

• The size of the accessory genome is given by 

n 

fj(n) ._ Q{n,N) ._ I y gN (3 3) 

1=1 

where ljr=i ^i' ~ ( T^l=i ^i^) ^ 1 is the set of genes present in any indi- 
vidual from the sample. 

• The gene frequency spectrum (of the accessory genome) is given by G^""* := 
g("'^\...,g1"' g1"'^\ where 

Ci"^ := G^"^^' := |{u e / : u e for exactly k different (3.4) 

Remark 3.1 (Notation). In the following results, we will suppress the super- 
script N of the population size. Instead, we query that our results hold in the 
large population limit. E.g. if we say that (3.6) holds in the large population 
limit, we really mean that 



(1 + P)m / 

The proofs of all results presented here are given in SectionjSj For first moments, 
we provide proofs using diffusion theory and Lemma |2.5[ For second moments, 
we rely on the Ancestral Gene Transfer Graph (AGTG) of Section |4] Since 
the proofs of the results are either using Lemma [23] or the AGTG or both, we 
formulate the following three Theorems. 

Theorem 1 (Gene frequency spectrum). Consider a sample of size n taken 
from the Moran model for bacterial genomes with horizontal gene flow with 
p > 0,0 > 0,7 > in equilibrium. Then, in the large population limit, it holds 
that 

(»)fc f 1 + V — l^il^" 

with {a)i := a{a + 1) • • • (a + 6 — 1) and (a)t, a{a — 1) • • • (a — 6 + 1). 

Theorem 2 (More sample statistics). Under the same assumptions as in The- 
orem^ 



^ ' k n-l + p)i, V ^(n + p),-„m\ J ^ ' 



r 




(3.6) 
(3.7) 
(3.8) 
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2 4 6 8 



number of individuals, k 

Figure 1: The expected gene frequency spectrum from Theorem [l] is highly 
dependent of 7, the rate of horizontal gene flow. For high values of 7, most 
genes are in high frequency, leading to a closed pangenome. We use p = 2 in 
the figure. 



in the large population limit. 



Theorem 3 (Second moment of the number of genes). Under the same as- 
sumptions and in the large population limit as in Theorem [7], we have, in the 
limit 7 — > 0, 



7 1 / 1 



(l + p)(2 + p) 



(2)1 _ 



(l + /9)2(3 + 2p)(2 + 7p + 6p2) 
1 



l + p\2 (l + p)(l + 2p) V2(2 + p) 

2(12 + llOp + 248p2 + 209p3 + 60p'') 
^ (1 + p){2 + p)(l + 2p)2(3 + 2p)(2 + 3p)(6 + 5p) 



(3.9) 

(3.10) 
7) +0(7')- 



Remark 3.2 (Open and closed pangenomes). Recently, the concepts of open 
and closed pangenomes were introduced (Medini et al. 2005). If, after sequenc- 



ing a finite number of genomes, all genes present in the population are found, 
one speaks of a closed pangenome. If new genes are found even after sequencing 
many cells, the pangenome is open. It is not hard to see that high values of 7 
imply that most genes are in high-frequency. In other words, sequencing a new 
individual hardly leads to new genes which were not seen before. This impact 
of openness and closedness of the pangenome can as well be seen from Figure [T] 
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4 The Ancestral Gene Transfer Graph 



Since the seminal work of Kingman ( 1982 ) and Hudson] ( |1983[ ), the genealogical 



view is a powerful tool in the analysis of population genetic models. Here, 
we will give a genealogical construction in order to obtain the distribution of 
, ...Qn for ^ sample of size n e N in the large population limit of the Moran 
model with horizontal gene transfer in equilibrium. The resulting genealogy is 
denoted the Ancestral Gene Transfer Graph (AGTG). In this random graph, 
every ancestral line splits at constant rate 7/2 per gene due to a potential gene 
transfer event. We note that such events leading to potential ancestors are well- 



known for the ancestral selection graph (ASG) of Neuhauser and Krone (1997) 



and Krone and Neuhauser (1997). However, potential ancestors in the ASG 
arise by fitness differences within the population while the potential ancestors 
within the AGTG may take effect by events of horizontal gene transfer. 

We start with the construction of the genealogy for a single gene and come 
to the full picture including all genes afterwards. 

Definition 4.1 (The AGTG for a single gene). Gonsider a random graph An 
which arises as follows: Starting with n lines, denoted i = I, ...,n, 

• each (unordered) pair of lines coalesces at rate 1, 

• each line disappears at rate p/2 (meaning that the gene was lost), 

• each line splits in two lines at rate 7/2 (meaning that the gene was hor- 
izontally transferred from another individual, such that the gene can now 
have two different origins). 

Sample a single point E uniformly at random according to the length measure 
from the graph. (This point determines the time when the gene under consid- 
eration was gained.) For every line l,...,n, let Gi = 1 if there is a direct (i.e. 
increasing in time) path from i to E and Gi = otherwise. Then, (Gi, G„) 
is denoted the gene distribution of a single gene read off from the AGTG. 



A4 



Figure 2: In the 
construction of the 
AGTG for a single 
gene A4, start with 4 
lines at the bottom 
of the figure. Every 
pair of lines coalesces 
with rate 1, and ev- 
ery line splits at rate 
7/2 and disappears 
(marked by •) at rate 
p/2. 



For later use, we show that all moments of the length of the AGTG are finite. 
In particular, the length is almost surely finite, and the uniform distribution 
according to the length measure, from which E is picked, is well-defined. 
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Lemma 4.2 (Length of AGTG for a single gene has finite moments). 

Let An be the AGTG for a single gene from Definition \4.1\ and let L{An) be its 

length. Then, E[i(A„)''] < oo for all fc = 1, 2, ... 

Proof. The number of fines in An is a birth-death process with birth rate Xi — 
7^/2 and death rate 'fii = {!^) + pi/2 (when there are i fines) and as absorbing 
state. Since the length during times with i lines increases at rate i, L{An)/'2 is 
distributed as the hitting time T of of a birth-death process {Zt)t>o with rates 
Xi = 7 and fii = i — 1 + p, i = 1,2, ... and absorbing state 0. In order to show 
finite moments of T, note tfiat the process {Zt)t>o with Zt — Zt ~ I is bounded 
from above by a birth-death process with birth rates = A and death-rate 
fik — k. In other words, {Zt)t>o is the number of customers in an M/M/oo- 
queue. Let S denote the partial busy period of this queue (i.e. the first time 
when the queue is empty). Moreover, when Zt = we have that Zt = 1 and 
there is a chance p/ (A + p) that T is reached after an exp(A+p)-distributed time. 

From this construction, we see that T<Si + --- + Sn where Sk = S + S' , and 
S' ~ exp(A -I- p) independent from S, and N ^ geom(p/(A -|- p)), all S^s being 



independent. Hence, the assertion follows from finite moments of 5' (Artalejo 



and Lopez-Herrero 20011. □ 



We now come to the desired connection between the Moran model with hori- 
zontal gene transfer and the AGTG. 

Lemma 4.3 (Gene distribution of Mor an m odel and AGTG coincide). Fix u £ 

[0, 1] and let , be as in Remark 2.2. Then, for N — > oo, the distribution 



of [u), ...,Q^ {u) , conditioned on lJi=i^i^('^) converges weakly to the 



distribution of {Gi, G„) from Definition ^.1 



Proof. Consider the graphical construction of a Moran model with horizontal 
gene flow from Definition 2.1 run between times — oo and 0. Let Qf{—t) be the 



finite measure describing the genome of individual i at time —t. If we consider 
only a single gene u G [0, 1], we can use the following procedure in order to 
obtain the genealogy and distribution of gene u in , : 

1. Restrict the Moran model to (i) resampling events, (ii) potential gene loss 
events of gene u at rate p/2 per line, (iii) potential horizontal gene transfer 
events of gene u at rate j/2N per pair of individuals. 

2. Put gene gain events on all lines according to a Poisson point process with 
intensity {9/2)du. 

Clearly, by 1. we can determine the coordinates {i, —t) with the property, that 
u £ Uj=i ^j^(O) iff u G Gi^{—t). This subgraph is a random graph, which can be 
constructed from time backwards as follows: Starting with n lines, any pair of 
lines coalesces at rate 1, every line is killed at rate p/2 and every line splits in two 
lines (due to a horizontal gene transfer event) at rate ^{N — i)/2N if there are 
currently i lines within the graph. (For the latter rate, observe that the donor 
of gene u might already be part of the graph.) Hence, as iV — > oo, this random 



graph converges (weakly) to the AGTG for a single gene as in Definition 4.1 
In addition, for small e, genes in {u — e,u + e) are gained at most once on this 
graph. Hence, when conditioning on the event of a gene gain of gene u on the 
random graph (i.e. the Poisson point process has a point {x,u)), by well-known 
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4°^ 



A, 



(1) 



_t2. 



2} 



Figure 3: In the construction of the AGTG, start with the clonal genealogy of 
the sample (here of size 4), i.e. with Af\ Then, in order to obtain the genealogy 
of the first gene, construct ^4^'' by additional splitting events (at rate 7/2), loss 
events (at rate p/2), both marked by 1, and coalescence events. Iteratively, 
construct 4"^^' by keeping all hnes in Uf^Q4*'' ^-nd adding splitting, loss and 
coalescence events. In the three figures, the solid lines are the ones where - 
potentially - the corresponding gene can be gained. This means that gene 2 
is present only if the second gain event at time T2 occurs at a time which is 



("2) 

smaller than the sum of all (vertical) lines in A\ , i.e. smaller than L{A^ 
it occurs, it is put uniformly on the solid lines. 



(2)n 



If 



properties of Poisson processes, this event is uniformly distributed on the graph. 
In other words, the distribution is the same as that of E in Definition 14.11 □ 



While the construction of the genealogy of a single gene was straight-forward, 
considering all genealogies of all genes seems to be harder. The reason is that 
there can be infinitely many genes, and each of these genes comes with its own 
events of gene gain, loss and horizontal gene flow. Even worse, we can decide 
on the presence or absence of a gene only if we know if there was a gene gain 
event somewhere along the genealogy, which means that we have to follow all 
(uncountably many) potential genes back in time. 

We will resolve such difficulties by constructing (countably) many potential 
genealogies and model gene gain events along them. The result is the ancestral 
gene transfer graph (AGTG) for infinitely many genes. An illustration is found 
in Figure [3] 



Definition 4.4 (The AGTG for infinitely many genes). Consider a sequence 
of coupled random graphs which arise as follows: 



aW dl) d2) 



An^ is distributed according to Kingman's coalescent, i.e. starting with n 
lines, each (unordered) pair of lines coalesces at rate 1 (and the graph is 
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stopped upon reaching 1 line). 

Given An \ the random graph A^n^ gives all potential ancestors of gene 1 and 
is constructed such that: Starting with the same n lines as in An ^ ; 

• each line splits in a continuing and an incoming line at rate 7/2 (meaning 
that the gene was horizontally transferred from the incoming line). If the 
line was part of A^"^ , the continuing line runs along An'' as well. The 
resulting splitting event is marked with "1 " 

• each line is terminated by a loss event, also marked with "1", at rate p/2 
(indicating that gene 1 was lost). 

• each (unordered) pair of lines in {A^}'^ \A\?Y "^^^ ^'"^ 
coalesces at rate 1 (and A^n^ is stopped upon reaching 1 line). 

Given A^n \ A^n \ the random graph A^n^^'^ gives all potential ancestors of 
gene fc + 1 and is constructed such that: Starting with the same n lines as in 
A^°^ 

^n ? 

• each line splits in a continuing and an incoming line at rate 7/2 (meaning 
that the gene was horizontally transferred from the incoming line). If the 

line was part of [J^^^An K the continuing line runs along lJj=o"^"^ ^'^ 
well. The resulting splitting event is marked with "k + 1". 

• each line is terminated by a loss event, also marked with "k + 1", at rate 
p/2 (indicating that gene 1 was lost). 

• each ( unordered) pair of lines in {A[i^'^'^ \ Uj=o -^n^Y ^'^^ in lJj=o "^"^ ^ 

{A^~^^^ \ Uj=o-^"^) (coalesces at rate 1 (and the graph A^'^^'^ is stopped 
upon reaching 1 line). 

In order to model gene gain events, consider the events {Tm,Um)m=i,2.... of a 
Poisson point process on [0, 00) x [0, 1] with intensity measure ^9 dtdu (ordered 
by their first coordinate). For all k, 

• let L(A^n^) be the length of Ait^ . If T^^ < L{A^''), pick a point uni- 
formly at random according to the length measure on A^i^^ . (This point de- 
termines the time and line hen the gene under consideration was gained.) 

Finally, for every i = l,...,n, let Uk G Qi if there is a direct (i.e. increasing 
in time) path from i to Ek- Then, {GiT--,Gn) is denoted the Gene distribution 
read off from tlie AGTG in the infinitely many genes model. 

Remark 4.5 (Alternative way of distribution gain events on the AGTG). In 
the last step of constructing the AGTG, we used the condition Tk < L{Aih^) 

(k) 

in order to distribute a uniformly chosen point Ek on An . In distribution, the 
same result is achieved as follows: If T^ < choose a way of running 

through ^i*'-' along all paths at constant speed. Then, the gene gain event is 
placed after running length Tk. 

The following Lemma is the key element in the proofs given in Section [Sj 
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Lemma 4.6 (Gene distribution from Mor an m odel and A GTG coincide). Fix 

as in Definition \2. l\ and Remark \2.^ and (Qi, ...,t/„) 

as in Definition\4-4\ Then, 



{N}s 



(4.1) 



well 



7V\ -/V— »o Q 



(S)gn) 



(4.2) 



Remark 4.7 (Interpretation of (4.1 1 and (|4.2[) and a convergence criterion) 



1. Note that the space of finite (counting) measures on [0, 1] is equipped with 
the topology of weak (or vague) convergence. In addition, we will interpret 
a vector (^i, ^„) of counting measures on [0, 1] as a counting measure on 
{1, X [0, 1]. Henceforth, we write e^(U,'Li{z} x A,) = nr=i (Ai) 

for Ai, An S S([0, 1]) such that (4.1 1 is the same as 



((gf,/i),...,(c?W,/„))^==^((c?i,/i),...,(t;„,/„,)) 

for all /!,...,/„ eC([0,l]). 

Since 1 e C([0, 1]), ( |4.1| also implies the convergence of total masses 
of tjf . In addition, (4.2) is stronger because total masses of products, 
{Gi , 1) • • • {Gn ■, 1) converge as well. 



2. In our proof, we use the following convergence criterion from [Kallenberg] 
(2002 ), Proposition 16.17, here adapted for random measures on a compact 



space: 



Let C)?"'^iC^7 •■• be random counting measures on a compact 
metric space /, where ^ is simple. Then, ^„ ^~^°^ > ^, if (i) 
P(C„(A) = 0) P(C(^) = 0) for all open AC/ and (ii) 

limsup„_^oc E[f„(A)] < E[^(yl)] < oo for ah compact AQI. 



Proof of Lemma \4 (>\ We proceed in five steps. In Step 1, we define another 
set of models for a population of size N with horizontal gene transfer, indexed 
by K, in which / — [0, 1] is separated into K classes of genes, Af — [{i — 
1)/K;i/K),i — 1,...,K. For the resulting genomes, denoted (C/f^'^, CJ^'^), 
we show in Step 2 that the genealogies of {Gi''^ , ■■■,Gn'^) £^re given by an 
AGTG with K + 1 coupled random graphs. The construction of these random 
graphs can be re-ordered such that the limit K ^ oo can be taken easily; see 
Step 3. In Step 4, we let iV — > oo and show the convergence of the coupled 
random graphs to (A'^^^ A^^\ . . .) , implying the convergence to (5i,...,^n). In 
the last step we show convergence of second moments. 

Step 1: Definition ofG^'^: Fix K eN and set Af := [{i - l)/K;i/K),i = 
1, K. We define another Moran model (called MorauA-model) with horizon- 
tal gene transfer. Briefly, in this model, all genes u € Af^ follow the same gene 
loss and gene transfer events. Precisely, in addition to resampling events (at 
rate 1 for every ordered pair of individuals), the following events occur: 
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1. Gene loss: For all k — l,...,K, a gene loss event occurs at rate 7/2 
per individual. Upon such an event in individual i, we have Q^'^ {t) = 
Gf'^{t~) \ A^, i.e. all genes u G are lost from Qf'^lt). 

2. Gene gain: For every individual i, at rate 0/2, choose U uniform in [0, 1]. 
If C/ e Af , set &f '^(i) = (C'^(i-) \ Af ) U {U}, i.e. U is the only gene 
in^?f-^(t)nAf. 

3. Horizontal gene transfer: For every (ordered) pair of individuals (i, j) and 
k — l,...,i4r, a horizontal gene transfer event occurs at rate ^/2N. For 
such an event, set ' (t) — Gj ' {t—)U{Gi' )nA^), i.e. individuali 
is the donor of all genes u G Q^'^ [t—) n A^ to the acceptor j. 

Again, {Gi'^{t),.-.,G^'^{t))t>o is a Harris recurrent Markov chain. We 
start it at time —00 and thus obtain the equilibrium measures {Gi'^ '■= 
tyf'^(O), := G^^'^'m by time 0. 

Step 2: {Gi^'^ , ■■■,Gn'^) can be constructed using K +1 random graphs: Recall 
the construction of the AGTG for a single gene from Definition |4.1| We extend 
this construction in order to obtain the distribution of (^f'^, ...,Gn'^)- Since 
K is finite, we can proceed by a two-step procedure similar to the proof of 
Lemma [473] in the Moran^ model. Here, we first generate resampling, gene loss 
and transfer events and only afterwords introduce gene gain events. So, first 
consider a Moran model with (i) resampling events, (ii) potential gene loss events 
for genes in A^ with rate p/2 along all lines, where a transition from Gi ' (t) 
to G^'^it) \ occurs, k — 1, ...,K and (iii) potential gene transfer events of 
genes in A^ with rate ^ /2N per pair of individuals, as in MorauA, fc = 1, K . 
Next, introduce gene gain events for all lines and all Ak,k — 1,...,K at rate 
6/2K, where each new gene is assigned a uniformly distributed random variable 
on Af . 

Equivalently, as for the AGTG for a single gene, we can start from time 
backwards and construct K + 1 random graphs such that graph k describes the 
possible ancestry of genes in A^,k = 1,...,K. Precisely, start with graph 0, 
which is a coalescent started with n individuals (without gene loss and horizontal 
gene transfer events). In graph 1, add gene loss events, valid for all u G Af 
and gene transfer events, which lead to splits of lines in the graph at rate 
7(A^ — m)/2N, if it currently has m lines. In addition, at rate ^m/2N , the 
split of a line leads to ancestry to a line which is already within the graph. 
Iteratively, in graph fc, additional loss and split events, valid for genes u G Af , 
occur. Again, a split might origin from a line which was already present in 
graph 0, ...,fc — 1, and otherwise gives a new line. These graphs are denoted 

.(0) j{K) 

After having constructed all K + I random graphs, graphs 1, ...,K are hit 
by gene gain events, each with rate 9/2K. As above, each new gene in graph k 
is assigned a uniformly distributed random variable on Af . In each Af , keep 
only the gene which is closest to time 0, since in MorauA, a new gene in Af 
overwrites present ones. By this procedure, we can read off {Gi^'^\ Gn^''^^) 
from the random graphs, which are marked by gene gain events. Note that 
Gf'^{A^) < 1 and that graphs 1, ...,K are exchangeable by construction. 
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Step 3: {G^'^ , ■■■,Gn'^) {Gi T ■■■,Gn) '^''^d the limit can be constructed 

using infinitely many random graphs: In the construction of the last step, we 
reverse the order of generating gene gain events and the random graphs. First, 
let {Tjn, Um)m=i,2.... bc the points in a Poisson point process T on [0, oo) x [0, 1] 
with intensity ^dtdu, ordered by their first coordinates. Instead of construct- 
ing the random graphs 0, K + 1 in the order of the intervals A^, in 
[0, 1], we can as well construct the random graphs in the order of appearance 
of gene gain events. Formally, let Km '■= k if U„i G A^, i.e. Km gives the 
number of the interval A^ in which the mth mutation {Tm,Um)m=i,2.... ap- 
pears. Then, let ii := 1 and v+i := inf{m > v ■ Km 4- {^ii -f'^m-i} for 
r < K. This means that i^^j, Ki^ is the number of intervals in the order of 
the appearance of the first gene gain within each interval. Most importantly, 
M^°^ a'^^^ A'^^^ ) - A^^'''' .4^^'^ ^ since the Poisson 

point process T is independent of (-4^^^ ^, Ai: 

TV. a:) ^^'^ ^^"^ random graphs 

(.■^n^N,K^ ■■■^A!^n,k) are exchangeable. 

Now, consider gene gain events on A^^j^^. By construction, the first gene 
gain event at time Ti falls into Aj^^ . Hence, this graph is hit after an expo- 
nentially distributed time with rate 6/2. (Note the difference to the rate 6/2K 
from the last step.) In order to model this, take Ti (the time of the first gene 
gain in T), determine a set of paths how to move through A[^^^ and place a 
gene gain event after time Ti . In case the length of A^f/^^j^ is smaller than Ti , 

do nothing. Continuing, by construction, A^^^^j^ (recall Ki^ = K2 if K2 7^ Ki) 
is hit by a gene gain event, this event occurs by time T2 of T. Again, determine 
a set of paths how to move through A^^^^j^ and place a gene gain event after 

time T2, if possible. Continue until Tij^ and A!^^^^j^- 

In this construction, we can now let ii' — !■ cx), which means that we construct 
infinitely many random graphs, A^^^,A^^''J^T.■ such that the first K + 1 are 

distributed according to (.4|°]y ^, A^^/^]^, . . . , aI^^'^^) . On these infinitely many 
random graphs, we use all points {Tm, Um) in order to construct G^ ■ 

We now show that G^'^ k^oo^ ^ gj^ (where the latter is constructed from 
the infinitely many random graphs) as well as G^'^'^ k^oo^ ^ gj^ (the latter being 

the set of genomes from the Moran model), implying that G^ = G'^ , i.e. the 
genomes G^ can be constructed from infinitely many random graphs by using all 
points in T. We use the criterion from Remark 4.7 For the convergence to G^ , 



note that both, G^'^ and G^ can be constructed on a joint probability space, 
using the same (infinitely many) random graphs. The difference in construction 
is that for G^ , all points in T are used, while in G^'^ only the first points 
within each Af are used. Moreover, as long as at most one gene gain event hits 

Ai! 

.jv K random measures G'^'^ and G'^ agree on Af . Hence, we write, for 



15 



any Borel set A C {1, n} x [0, 1] and fc = 0, 1, 2, ... 

K 

< -4jv^K hit by 2 gene gain events 
1=1 

< K ■ P{A^j^^j^ hit by 2 gene gain events) 



(4.3) 



^^^^ 

,(l)xx2l i^^g 



< • E[l - exp{-eL{A'j^>)/2K)il + 9L{A'^^'^)/2K)] 



< -—niL{A'^')r] 0, 



implying (i) of Remark 4.7 2. Now, let L(y^J^]^ j^) and L{A^i l^) be the lengths of 



the random graphs -4^^]^ ^ and ^i^^, respectively and note that A^^lj = A^^ ]^. 



By construction, we write 

ef([0,l])=ElL(^S)>T, 



if 



(4.4) 



m—l 



Then, by exchangeability, for a rate-l-exponentially distributed r ando m variable 
X and E[L(yt^|^_^)] = E[L(y^^^]^)] < E[L(y^i)] < oo by Lemma 



4.2 



E[g^'^({l, n} X [0, 1])] ^ n • E[gf '^([0, 1])] 



= nX.P(L(^«^^)>^X) 
= • E[l - cxp{-0 L{A[^^j^ if)/2if)] 



(4.5) 



K^oo 



,n9-E[L{A[]>j,)]=n-E[g^ {[0,1])] 
E[g^({l,...,n}x [0,1])], 



which gives (ii) of the convergence criterion given in Remark 4.7 2. (Note 



that the finiteness of the right ha nd s ide of the last equation can be seen 



from E[L(^^"'^]^] < oo; see Lemma 4.2 ) Next, we come to the convergence 



gN,K gN ^ Again, we observe that both random measures can be con- 

structed on one probability space. Here, use the K + 1 random graphs in order 
to construct G'^'^ first and draw them as a part of a graphical construction of 
the MorauA-model, starting at time 0. Note that in the MorauA-model, gene 
gain events for a gene u e can lead to loss of another gene v G A^, if the 
line of the gene gain event carries gene v. For such genes, which are lost in 
the MorauA-model, put additional gene loss and transfer events in the (regu- 
lar) Moran model. Again, we claim that Q^'^ — if every random graph 



A^^^ A 



n^N K hit by at most one gene gain event. Hence, the same 

K- 



calculations as in (4.3) and (4.5 1 gives the convergence G^'^ 
well. 



>oo 



as 



16 



N- 



Step 4- {Gi , ■■■,Gn ) > {Gi, ■■■,Gn), constructed from infinitely many ran- 

dom graphs: By now, we have shown that (t/f , Gn) can be constructed from 



infinitely many random graphs W'^'' , , . 



such that A'"^^ is a Kingman coa- 
lescent started with n hnes, -4^^^'' has additional coalescence events and split 
events at rate 7(iV — k)/2N, if there are k lines in graphs A'"^\ ...-4^''. Now, as 
N — > oo, the splitting rate converges to 7/2. By weak convergence of the ran- 

> {Gl, Gn)- 



dom graphs, the genomes converge as well, i.e. {Gi , ■■■,Gn) =^~*°^ 



Precisely, we again have to check (i) and (ii) of Remark 4.7 2. For (i), first note 
that 



(^.([0,1]) > C) < ^nGkiiO,!])] = 4^E[L(^W)] 



2C 



>0, 



'(^A^N* ^Pli^ event at rate ^m/2N^ 



(4.6) 



= 1 - E[cxp(-7L(^W)/2iV)] < E[7L(^^0/2A^)] 



(1)^ 



7V->-oo 



> 0. 



according to Lemma 4.2 So, we write for A C {1, n} x [0, 1] 

\P{G{A) = k)-V{G''{A) = k)\ 

c 

< PiGiA) >C)+P(^[J A^^^ hit by split event at rate jm/2N 



i=l 



(4.7) 



< PiG{A) >C)+C- P{A^N^ hit by split event at rate 7m/2iV) 



PiG{A) > C) 



by ( |4.6[ ) implying (i) of Remark 4.7 2. For (ii) (again noting that the same 
calculation holds for arbitrary compact A C {l,...,n} x [0,1]), we have, 
since {L(A].^))n=i,2,... is uniformly integrable, by standard arguments (see e.g. 
Bilhngsleyl ( |l999[ ). Theorem 3.5), 



E[G''{{l,...,n}x[OA])]=n0-nHA^lU 



> nO ■ E[L{A[^^)] = E[G{{1, n} x [0, 1])] < 00. 



(4.8) 



Step 5: Convergence of moments: The calculations are similar to (4.7) and (4 
We only have to deal with finitcness of m omen ts in order to show Gi^ 
r.N H 



^Gii 
know that {L{A 



(l)^ 
N ) 



L{A^^'^))n=i,2,... is uniformly integrable by Lemma 



Gj- Here, (i) of Remark 4.7 2 is implied by (4.7). For (ii), we 



(sinceL(^«)---L(^(;')) <L(^W)" 



4.2 



integrable by Lemma 4.2) and L(ylj^'')' 
Hence, 



+L{A^ and the latter is uniformly 



.L(A"^) 



TV-!. 00 



LiAi)---L{An 



E[c;f ®...®g/:(({l,...,n}x [0,1]))"] = 
f; • E[L(A) • • • HAn)] = nGi 

< 00. 



E[i(^W )...L(^f^)] 
■•0e„(({l,...,n}x [0,1])")] 



(4.9) 

□ 
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5 Proofs of Theorems [IHS] 
5.1 Proof of Theorem [T] 

Using diffusion theory and Lemma |2.5[ we obtain first moments of all of the 
statistics G^^\ . . . , Gn"' in equilibrium. Moreover, the statistics as considered 
in Theorem [2] are linear combinations of G'^"\ . . . , G^"''; see the first proof of 
Theorem [2] below. 

We consider the diffusion (2.1 ) with infinitesimal mean and variance 

= —^x + ■^x{l — x), cr^(a;) = a;(l — x). 

The Green function for the diffusion, measuring the time the diffusion, i.e. a 
gene, spends in frequency x until eventual loss, if the current frequency is S < x, 
is given by 

GiS,x) ^ 



a'^{x)'ilj{x) 
where 



Hy) ■■= exp (^-2^' ^rf^) - (1 - y) 

:= / ip{y)dy. 



^-Pg-iy 



Following Durrett (2008), we introduce new genes in frequency 5 ^ 1 at rate 
in a consistent way. That is, the gene raises in frequency to e > S with 

probability ^^fy- Hence the number of genes in frequency x is Poisson with 
mean 

^0(^^^'^'^^^^(l-x)i-p- 
The gene frequency spectrum is now given by 



n\ ,, ,,T(n — k + n) ^ 



k{n-l+ p)k\ {n + p)fnm\ 

where iFi(k; n + p; 7) = 1 + ^ {ri+7)'- ml ^ hypergeometric function. 

rn—l 



5.2 Proof of Theorem [2] 

We give two proofs, one using diffusion theory and Theorem [TJ one using the 
AGTG from Section H 
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Proof of Theorem^ using Theorem^ Given the expected gene frequency spec- 
trum from Theorem [l] it is now easy to compute first moments of A, D and G 
by using, in the infinite population hmit, 



g(") = \g,\ + \g,\g,\ + .. 



Gn \ U 



(5.1) 



such that 



= E[AW]=E[gW]-- 1+^ 



e k 



E 



oo 



— ; fc fc - 1 + p ^ (fc + 



1 

1 . ^r^ 7 



m+1 

n-1 



E7™E 

m=0 /c=0 
ri— 1 



m 

m=l fe=0 



1 1 

(fc + p)„-, ~ (fc + l + p). 



1 



□ 



Proof of Theorem\^ using the AGTG. First wc note that v4(i) = G^^) and 
= UliSi'^ ^G?^)\ G^'H + \{G['^ U \ such that 

E[A(")] =E[A(i)] =E[G(i)], 

E[Li(")] = E[i:i(2)] ^ e[G(2)] - E[G'-% 

and it suffices to compute E[G^"''] in the proof. Wc will abuse notation and 
write dx and dy for infinitely small portions of the genome. In order to compute 
E[G(")], the idea is to write := (^"^^ G,) A 1) and 



E[G(") 



E 



G'\dx] 



E[e"(dx)] 



, -E[LiAn]dx ^ -E[L{Ani 
lo z z 

(5.2) 

such that we have to compute the expected length of A", the AGTG for a 



single gene, which we denote by L{A"). Recall from the proof of Lemma 4.2 



that L{A^)/2 has the same distribution as the hitting time T of of a birth- 
death process {Zt)t>Q with birth rate = 7 and death rate fii — i + p — 1. 
Now, it is well known (see e.g. Karlin and Taylor 1975 1 that 



n-l 



E[LiAn]=E[T\Zo = n]=J2p^ + T.(EYj E 

i=l k = l \r=l / m=k+l 



(5.3) 
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where 



Ai • • • Ai_ 



7 



i-i 



r 



Combining (5.2) and (5.3) yields 



o 2 



7 



m — l — k 



r 



n— 1 oo 



E 



oo ri-1 



EE 



in=l fc=0 
n-1 



ri-1 



V ^ + ^™ 1 

fc=0 m=l fc=0 

n-l ^ oo n-1 . 



fc=0 
n-l 



„=1 "1 ^ V (P + fc)™ ip+k+ l)rn 



7™ / 1 



k=0 



+ P ^l'^ \{p)r-n {n + p)r 



(5.4) 



(5.5) 



According to (5.1), E[yl("'] is readily obtained and the expected number of 



differences is given using by 



E 



- = 7^(l+E ^ 



(2 + pU 



□ 



5.3 Proof of Theorem H 

Again, we rely on the AGTG and use approximations if the splitting rate 7 is 
smah. Since A^^) = j^Giidx), we write 

v[A(i)]= / v[gi{dx)]+ f f u^yCov[gi{dx),g,{dy)]. (5.6) 

Jo Jo Jo 
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First, given (the AGTG for a single gene), the expected number of gene gain 
events on is Poisson distributed with parameter {9/2)L{A^)dx such that 



¥[\gi{dx)\] =¥[E[\gi{dx)\\A^]] +E[Y[\giidx)\\A^]] 
= Y['^L{A^)dx] + E[E[\gi{dx)\ \A^]\ 

= ^E[L{A^)]dx + 0{dx^) = E[A^^'^]dx + 0{dx^) 



(5.7) 



7(2 + p)+7^ 
(l + p)(2 + p) 



+ 0{^^)^dx + 0{dx^) 



Second, for x 7^ y. 



cm[\gi{dx)i\gi{dy)\] = 



\^[\g^{dx)\\A\A\E[\g^{dv)\\A\A^]] 

+ E[cm[\g^{dx%\g^{dy)\\A\A^] 

Cm[lL{A^)dx, lL{A^)dy\ 



= —<Cm[L{A^),L{A^)]dxdy 

since |^i((ia;)| and |5i((i?/)| are independent given A^,A^. Now we compute 
COY [L { A^ ), L{A'^)] up to second order in 7. For this computation, we make 
use of the fact that the AGTG for two genes can be defined in analogy to the 
AGTG for a single gene from Definition |4.1[ but with two different kind of loss 
and transfer events. Precisely, we consider the following random graph: starting 
with x lines of state only gene 1, y lines of state both genes and z lines only gene 
2, pairs of lines coalesce at rate 1. (Note that coalescence of a line of state only 
gene 1 and a line of state only state 2 gives a single line of state both genes.) 
Lines where gene 1 (gene 2) is considered are lost at rate p/2. (If a line of state 
only gene 1 {only gene 2) is lost, it is lost completely, while if a line of state both 
genes is lost, it turns into a line of state only gene 2 (only gene 1).) Finally, 
every line of state only gene 1 [only gene 2) is split at rate 7/2 and the new 
line is again of state only gene 1 [only gene 2). In addition, a line of state both 
genes splits at rate 7 and the new line is of state only gene 1 and only gene 2, 
both with probability 1/2. The length of the graph of lines at states only gene 
1 [only gene 2) and both genes is denoted Li(t) {L2{t)) if a sample from time 
t of the population is considered. We write Ea;yz[.] for the expected value if the 
process is started as above. 

It is important to note that E[L{A^)L{A'^)] = Eoio[Li{t)L2{t)] for any t, 
since the AGTG describes the population in equilibrium. In order to compute 
IEoio[^i(0^2(i)], we use a time derivative and write 

Eow[Li{t + dt)L2{t + dt)] = (1 - (7 + p)dt)EQio[{Li{t) + dt){L2{t) + dt)] 

+ jdt ■ Eiio[(Li(<) + dt){L2{t) + dt)] + pdt-Q 



Using that the AGTG is in equilibrium and ignoring effects of order dt^ , we 
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obtain (for Li :— Li{t), i — 1,2) 

(7 + p)Eoio[LiL2] = Eoio[ii + ^2] + 7 ■ Eiio[iii2], 
(1 + |p + |7)Eno[iii2] = Eno[ii + 2L2] + 7 • E2io[LiL2] + ljEni[LiL2] 

+ \p ■ Eioi[£ii2] + (1 + Ip) ■ Eoio[iii2], 
(3 + 2p)E2io[iii2] = E2io[ii + 3L2] + (3 + p) • Eno[iii2] 

+ ip-E20i[LiL2]+O(7), 
(3 + 2p)Eni[iii2] = Eni[2ii + 2L2] + Eo2o[iii2] 

+ (2 + p) • Eiio[LiL2] + P ■ E2oi[iii2] + 0(7), 
(1 + 7 + /9)Eioi [L1L2] = Eioi[ii + L2] + Eoio[iii2] + 7 • E201 [^1^2], 
(3 + |p)E2oi [L1L2] = E201 [ii + 2L2] + (1 + p) • Eioi [LiL2] 

+ 2-Eno[LiL2] + 0(7), 
(1 + 2p)Eo2o[iii2] = Eo2o[2ii + 2L2] + Eoio[LiL2] 

+ 2p-Eiio[iii2]+0(7)- 

(5.8) 

Note that some terms 0(7) were written which will not lead to the first two 
leading terms in Eoio[Li{t)L2{t)]. The expectations Ej[Li] for i = 1,2 and 
j e {010, 110, 101, 210, 111, 201} can readily be computed using 



E,y,[Li] = ElLiA^'+y)] and E,^,[L2] = E[L(^^+^)] 



We use from (5.51 that 



E[i(^")] = ^ 



27 



+ 7 



n{n + 2p+ 1) 



fc=0 



k + p p{n + p) p{p + l){n + p){n + p + I) 



(5.9) 



such that 



E[L{A^)] 
E[L{A')] = 
E[L{A^)] = 



P 



1 + P 



+ 7^ 



p(p+l)(p + 2) 



P(l + P) 
6p2 + lOp + 4 
p(p+l)(p + 2) 



l + 2p + 27j +0(7'), 
Oh). 



Solving (5.8) using (5.9) and (5.10) gives 



im2 



COY[LiA'),L{A')]^Eoio[LiL2]-E[LiA')] 



p(l + p)2(3 + 2p)(2 + 7p + 6p2) 



7' + 0(7')- 



Combining (5.11) with (5.7) and (5.6) gives the result. 



(5.10) 



(5.11) 



To compute the variance for the number of differences D^^^ we will use a similar 
approach. Setting 2?i,2 := {Gi - ^^^)+ + - Q^^)^ we can write 2D^^'> = 
Jq T^i,2{dx) and thus 



1 /.I 



V[2Z?(2)]^/ Y[Vi^2{dx)]+ / U^yCm[Vi^2{dx),Vi^2{dy)]. (5.12) 



"'0 
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Let A2 be the subgraph of ^2*"* consisting of branches leading to either individ- 
ual 1 or individual 2 but not to both. If ^(^2) denotes its length, given 
the expected number of gene gain events leading to a difference between two 
given individuals is Poisson distributed with parameter {9/2)L{A2) such that 



(compare with (5.7)) 



V[|Pi,2(dx)|] =V[E[|2?i,2(d2;)||4^^]] +E[V[|Pi^2(dx)||4')]] 
= W[lL{Al)dx] +E[E[\Vi^2idx)\ |4'^]] 



(2)lWT.4^rOM^2^ (5.13) 



-E[L{Al)]dx + 0{dx^) = -E[2D^'^^dx + 0{dx 
^ ^ ■ '-0{^^)+0{dx^). 



2 + 3p + p 



2 



In the same way as seen below equation (5.7 1 we obtain, for i ^ j 
COY[\Vi^2idx)\,\Vi^2idy)\] - ^COY[LiA^),L{Ai)]dxdy. 

As E[_L(yl2)] • E[L(^2)] is already known the remaining part is to compute 

32 



^[^(-^^)^(-^^^)] - (l + p)(l + 2p) 

32(48 + 314p + 611/92+ 46V + 120p'')7 , ^^^2^ 



(1 + p){2 + p)(l + 2p)2(3 + 2p)(2 + 3p)(6 + 5p) 

(5.14) 

For that we wiU split {A^2\^2^) into two parts, T(4\4^^) and S'(4\4''')- 
Recall that there are three different types of events in (^2*\4''^)' namely loss, 
merging lines and splitting hues. The first part, T(4*\4''^)i contains solely 
the times Ti,T2, . . . between these events, while the second part, 5'(4*\4"''') 
contains the remaining information from {A^,A'^) on which lines split, merge 
and get lost, i.e. it is possible to describe the structure/topology/shape of 
the AGTG from S'(4\4^^)- Note that given S'(4\4^'), the times Ti = 
Ti(S'(4\4^^))>T2 = T2(5(4\4^^)):- are independent exponentially dis- 
tributed random variables with rates measurable with respect to S'(4*\ -^2''')- 
In particular, the number of lines between the fcth and {k + l)st time in 
T(4\4^^)> which lead to either one or the other of the individuals, but not 
to both, denoted by Dl = i:>^,(S'(4\ 4^^))> is S'(4\ 4^ ^measurable and 

L{A^)=Y,Din (5.15) 

A; 

Let S be the space of all possible shapes which can be taken by S'(4\4^^) 
and let 5-^2 := {s e 5 : P(s) ^ 0(72)}, i.e. Sj2 contains all shapes which have 
at most one splitting event. Within 5^2, there are at most 8 events before 



23 



{A2\A2 ^) has lost all lines, so we can write 



E[L(4*))L(4^"') 



^ p(.).E[^D^T,.^i^i:r,i5(4\4^")) 



fc=i 



fe=i 



Oh') 



fc=i 



l<fe,fe'<8;/c^/c' 

As Sj2 has more than 5000 elements we used Mathematica to compute P(s) - see 
the accompanying file - the variables Z?^(s), resp. Z?^(s), and the parameters 
of the exponentially distributed times Tfe(s) for 1 < fc < 8 and all s £ Sj2. 
Combining (5.131 and (5.14) gives the result as shown in (3.10). 
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